Tacotron 2
Arxiv: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3121510%2F0bae893b-2454-1bfe-3155-9f918c64d7c0.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=b50ace47afb9c26eda68ee4d5da621a3
https://github.com/NVIDIA/tacotron2実装(Unofficial Repo/NVIDIA) Tacotron 2 (without wavenet)
Tacotron 2, a neural network architecture for speech synthesis directly from text.
The system is composed of a recurrent sequence-to-sequence feature prediction network
that maps character embeddings to mel-scale spectrograms,
followed by a modified WaveNet model
acting as a vocoder to synthesize time-domain waveforms from those spectrograms.
Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech
以下、多分間違ってるので、ちゃんと読んだら修正
少し前のTTSモデルの基礎アーキテクチャ?
vocoderとしてWaveNetを採用
今ならWaveGlowに置き換えたほうがよさそう
これによりTacotronからScore向上?
conditioning input(条件付き入力)をMel Spectrogramに変更
元は、linguistic, duration, and F0 features. (よくわからない)
MOS(Mean Opinion Score): 4.53
比較、professionally recorded speechは、MOS: 4.58
FYI
https://qiita.com/atsushi11o7/items/ea659641c354b4001fe4Tacotron2の実装について解説してみる
https://github.com/Vaibhavs10/open-tts-trackerVaibhavs10/open-tts-tracker